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Abstract. We propose automated techniques for the verification and 
control of probabilistic real-time systems that are only partially observ¬ 
able. To formally model such systems, we define an extension of proba¬ 
bilistic timed automata in which local states are partially visible to an 
observer or controller. We give a probabilistic temporal logic that can 
express a range of quantitative properties of these models, relating to 
the probability of an event’s occurrence or the expected value of a re¬ 
ward measure. We then propose techniques to either verify that such a 
property holds or to synthesise a controller for the model which makes it 
true. Our approach is based on an integer discretisation of the model’s 
dense-time behaviour and a grid-based abstraction of the uncountable 
belief space induced by partial observability. The latter is necessarily ap¬ 
proximate since the underlying problem is undecidable, however we show 
how both lower and upper bounds on numerical results can be generated. 
We illustrate the effectiveness of the approach by implementing it in the 
PRISM model checker and applying it to several case studies, from the 
domains of computer security and task scheduling. 


1 Introduction 

Guaranteeing the correctness of complex computerised systems often needs to 
take into account quantitative aspects of system behaviour. This includes the 
modelling of probabilistic phenomena, such as failure rates for physical compo¬ 
nents, uncertainty arising from unreliable sensing of a continuous environment, 
or the explicit use of randomisation to break symmetry. It also includes real-time 
characteristics, such as time-outs or delays in communication or security proto¬ 
cols. To further complicate matters, such systems are often nondeterministic 
because their behaviour depends on inputs or instructions from some external 
entity such as a controller or scheduler. 

Automated verification techniques such as probabilistic model checking have 
been successfully used to analyse quantitative properties of probabilistic, real¬ 
time systems across a variety of application domains, including wireless com¬ 
munication protocols, computer security and task scheduling. These systems are 
commonly modelled using Markov decision processes (MDPs), if assuming a dis¬ 
crete notion of time, or probabilistic timed automata (PTAs), if using a dense 



2 


Norman, Parker, Zou 


model of time. On these models, we can consider two problems: verification 
that it satisfies some formally specified property for any possible resolution of 
nondeterminism; or, dually, synthesis of a controller (i.e., a means to resolve 
nondeterminism) under which a property is guaranteed. For either case, an im¬ 
portant consideration is the extent to which the system’s state is observable to 
the entity controlling it. For example, to verify that a security protocol is func¬ 
tioning correctly, it may be essential to model the fact that some data held by 
a participant is not externally visible, or, when synthesising a controller for a 
robot, the controller may not be implementable in practice if it bases its decisions 
on information that cannot be physically observed. 

Partially observable MDPs (POMDPs) are a natural way to extend MDPs in 
order to tackle this problem. However, the analysis of POMDPs is considerably 
more difficult than MDPs since key problems are undecidable [53]. A variety of 
verification problems have been studied for these models (see e.g., mm) and 
the use of POMDPs is common in fields such as AI and planning [5], but there 
is limited progress in the development of practical techniques for probabilistic 
verification in this area, or exploration of their applicability. 

In this paper, we present novel techniques for verification and control of prob¬ 
abilistic real-time systems under partial observability. We propose a model called 
partially observable probabilistic timed automata (POPTAs), which extends the 
existing model of PTAs with a notion of partial observability. The semantics of 
a POPTA is an infinite-state POMDP. We then define a temporal logic, based 
on [27], to express properties of POPTAs relating to the probability of an event 
or the expected value of various reward measures. Nondeterminism in a POPTA 
is resolved by a strategy that decides which actions to take and when to take 
them, based only on the history of observations (not states). The core problems 
we address are how to verify that a temporal logic property holds for all possible 
strategies, and how to synthesise a strategy under which the property holds. 

In order to achieve this, we use a combination of techniques. First, we develop 
a digital clocks discretisation for POPTAs, which extends the existing notion for 
PTAs [50], and reduces the analysis to a finite POMDP. We define the conditions 
under which properties in our temporal logic are preserved and prove the cor¬ 
rectness of the reduction. To analyse the resulting POMDP, we use grid-based 
techniques [23129] . which transform it to a fully observable but continuous-space 
MDP and then approximate its solution based on a finite set of grid points. We 
use this to construct and solve a strategy for the POMDP. The result is a pair 
of lower and upper bounds on the property of interest for the original POPTA. 
If the results are not precise enough, we can refine the grid and repeat. 

We implemented these methods in a prototype tool based on PRISM [T9] , 
and investigated their applicability by developing three case studies: a non¬ 
repudiation protocol, a task scheduling problem and a covert channel prevention 
device (the NRL pump). Despite the complexity of POMDP solution methods, 
we show that useful results can be obtained, often with precise bounds. In each 
case study, nondeterminism, probability, real-time behaviour and partial observ- 
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ability are all crucial ingredients to the analysis, a combination not supported 
by any existing techniques or tools. 

Related work. POMDPs are common in fields such as AI and planning, and 
have many applications [8] . They have also been studied in the verification com¬ 
munity, e.g. mm, establishing undecidability and complexity results for var¬ 
ious qualitative and quantitative verification problems. Work in this area often 
also studies related models such as Rabin’s probabilistic automata [TTj, which 
can be seen as a special case of POMDPs, and partially observable stochas¬ 
tic games (POSGs) [EJ, which generalise them. More practically oriented work 
includes: |15j . which proposes a counterexample-driven refinement method to 
approximately solve MDPs in which components have partial observability of 
each other; and da, which synthesises concurrent program constructs, using a 
search over memoryless strategies in a POSG. Theoretical results [B. and algo¬ 
rithms Ena have been developed for synthesis of partially observable timed 
games. In [6], it is shown that the synthesis problem is undecidable and, if the 
resources of the controller are fixed, decidable but prohibitively expensive. The 
algorithms require constraints on controllers: in |[9J, controllers only respond to 
changes made by the environment and, in m, their structure must be fixed in 
advance. We are not aware of any work for probabilistic real-time models. 

This paper is an extended version, with proofs, of [25] , 

2 Partially Observable Markov Decision Processes 

We begin with background material on MDPs and POMDPs. Let Dist(X) denote 
the set of discrete probability distributions over a set X , S x the distribution that 
selects x £ X with probability 1, and ffi. the set of non-negative real numbers. 

Definition 1 (MDP). An MDP is a tuple M =(S,s, A, P, R) where: S is a set 
of states; s £ S an initial state; A a set of actions; P : Sx A —► Dist(S) a 
(partial) probabilistic transition function; and R : SxA —» R. a reward function. 

Each state s of an MDP M has a set A(s) = f {a £ A | P(s,a) is defined} of 
enabled actions. If action a £ A(s) is selected, then the probability of moving to 
state s' is P(s , a) (s') and a reward of R(s, a) is accumulated in doing so. A path 
of M is a finite or infinite sequence ui = soaoSiai • • • , where Si £ S, eq £ A(sj) 
and P(si, ai)(sj+i)>0 for all ieN. We write FPathsu and IPaths m, respectively, 
for the set of all finite and infinite paths of M starting in the initial state s. 

A strategy of M (also called a policy or scheduler ) is a way of resolving the 
choice of action in each state, based on the MDP’s execution so far. 

Definition 2 (Strategy). A strategy of an MDP M=(5, s, A, P, R) is a func¬ 
tion a : FPathsM^-Dist(A) such that cr(soaoSi ... s ra )(a)>0 only if a £ A(s n ). 

A strategy is memoryless if its choices only depend on the current state, finite- 
memory if it suffices to switch between a finite set of modes and deterministic 
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if it always selects an action with probability 1. The set of strategies of M is 
denoted by Am- 

When M is under the control of a strategy < 7 , the resulting behaviour is 
captured by a probability measure Prft over the infinite paths of M j!8j . 

POMDPs. POMDPs extend MDPs by restricting the extent to which their 
current state can be observed, in particular by strategies that control them. In 
this paper (as in, e.g., mm), we adopt the following notion of observability. 

Definition 3 (POMDP). A POMDP is a tuple M=(5, s. A , P, P, O, obs) where: 
( S , s, A, P , R) is an MDP; O is a finite set of observations; and obs : S —»• O is a 
labelling of states with observations. For any states s, s' £ S with obs(s)=obs(s'), 
their enabled actions must be identical, i.e., A^s^A^s'). 

The current state s of a POMDP cannot be directly determined, only the corre¬ 
sponding observation obs(s) £ O. More general notions of observations are some¬ 
time used, e.g., that depend also on the previous action taken or are probabilistic. 
Our analysis of probabilistic verification case studies where partial observation 
is needed (see, e.g., Sec. [5| suggests that this simpler notion of observability 
will often suffice in practice. To ease presentation, we assume the initial state is 
observable, i.e., there exists d £ O such that obs(s)=o if and only if s=s. 

The notions of paths, strategies and probability measures given above for 
MDPs transfer directly to POMDPs. However, the set Pm of all strategies for 
a POMDP M only includes observation-based strategies , that is, strategies er 
such that, for any paths n = s 0 a o s i ... s n and n' = s^ac/s} ... s' n satisfying 
obs(si) = obs(s' i ) and cq = a! i for all i, we have cr(7r) = 

Key properties for a POMDP (or MDP) are the probability of reaching a 
target, and the expected reward cumulated until this occurs. Let O denote the 
target (e.g., a set of observations of a POMDP). Under a specific strategy a, we 
denote these two properties by Pr ft(F O ) and Eft(F O), respectively. 

Usually, we are interested in the optimal (minimum or maximum) values 
Pr^ t (F O) and E^ 4 (F O), where opt £ {min, max}. For a MDP or POMDP M: 

PrfT(F O) = inf ffe ;c M Pr° M (F O) E““(F O) d = f inf^ Eft (F O) 

PrZ ax (FO) =snp aGSu Prft(FO) E”-(FO) ^ sup^ Eft(FO) 

Beliefs. For POMDPs, determining the optimal probabilities and expected re¬ 
wards defined above is undecidable |23], making exact solution intractable. A 
useful construction, e.g., as a basis of approximate solutions, is the translation 
from a POMDP M to a belief MDP P(M), an equivalent (fully observable) MDP, 
whose (continuous) state space comprises beliefs , which are probability distribu¬ 
tions over the state space of M. Intuitively, although we may not know which of 
several observationally-equivalent states we are currently in, we can determine 
the likelihood of being in each one, based on the probabilistic behaviour of M. 
The formal definition is given below, and we include further details in Appx. [B] 

Definition 4 (Belief MDP). Let M=(P, s, A, P, R, O, obs ) be a POMDP. The 
belief MDP of M is given by B{M)=(Dist(S), 6$, A, P B , R B ) where, for any beliefs 
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b, b' £ Dist(S) and action a £ A: 



and b a ’° is the belief reached from b by performing a and observing o, i.e.: 



The optimal values for the belief MDP equal those for the POMDP, e.g. we have: 


,max 


(FO) = Pr%$ ) (FT 0 ) and E™ X (F O) = E“-(FT 0 ) 


Pr\ 


M 


where To = {b £ Dist(S) \ Vs £ S. (b(s )>0 —> obs(s) £ O)}. 

3 Partially Observable Probabilistic Timed Automata 

In this section, we define partially observable probabilistic timed automata (POP- 
TAs), which generalise the existing model of probabilistic timed automata (PTAs) 
with the notion of partial observability from POMDPs explained in Sec. [2} We de¬ 
fine the syntax of a POPTA, explain its semantics (as an infinite-state POMDP) 
and define and discuss the digital clocks semantics of a POPTA. 

Time & clocks. As in classical timed automata [2J, we model real-time be¬ 
haviour using non-negative, real-valued variables called clocks , whose values in¬ 
crease at the same rate as real time. Assuming a finite set of clocks X 1 a clock 
valuation v is a function v : X— and we write lA for the set of all clock 
valuations. Clock valuations obtained from v by incrementing all clocks by a 
delay t £ R. and by resetting a set X C X of clocks to zero are denoted v+t and 
u[A:=0], respectively, and we write 0 if all clocks are 0. A (closed, diagonal-free) 
clock constraint f is either a conjunction of inequalities of the form x^c or x^c, 
where x £ X and c £ N, or true. We write v f= ( if clock valuation v satisfies 
clock constraint ( and use CC(X) for the set of all clock constraints over X. 

Syntax of POPTAs. To explain the syntax of POPTAs, we first consider the 
simpler model of PTAs and then show how it extends to POPTAs. 

Definition 5 (PTA syntax). A PTA is a tuple P=(L, l, X, A, inv, enab, prob, r) 


where: 


— L is a finite set of locations and l £ L is an initial location; 

— X is a finite set of clocks and A is a finite set of actions; 

— inv : L —> CC(X) is an invariant condition; 

— enab : Lx A —> CC(X) is an enabling condition; 

— prob : Lx A —» Dist\2 x xL) is a probabilistic transition function; 

— r=(i'L,rA ) is a reward structure where vl : L —)■ R is a location reward 
function and i 'a : Lx A —> K. is an action reward function. 
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A state of a PTA is a pair (l,v) of location l £ L and clock valuation v £ 
R*. Time t £ R can elapse in the state only if the invariant inv(l) remains 
continuously satisfied while time passes and the new state is then (l,v+t). An 
action a is enabled in the state if v satisfies enab(l,a) and, if it is taken, then 
the PTA moves to location l' and resets the clocks X C X with probability 
prob(l,a)(X,l'). PTAs have two kinds of rewards: location rewards, which are 
accumulated at rate tl(1) while in location l and action rewards r^(/,a), which 
are accumulated when taking action a in location l. PTAs equipped with reward 
structures are a probabilistic extension of linearly-priced timed automata [5]. 

Definition 6 (POPTA syntax). A partially observable PTA (POPTA) is a 
tuple P = (L,l, X , A, inv, enab, prob,r,OL, obsi) where: 

— (L,l, X , A, inv, enab, prob,r) is a PTA; 

— Ol is a finite set of observations; 

— obsL : L —> Ol is a location observation function. 

For any locations 1,1' £ L with obsL(l)=obsL{l l ), uie require that inv(l)=inv(l') 
and enab{l,a)=enab{l',a) for all a £ A. 

The final condition ensures the semantics of a POPTA yields a valid POMDP: 
recall states with the same observation are required to have identical available 
actions. Like for POMDPs, for simplicity, we also assume that the initial location 
is observable, i.e., there exists o £ Ol such that obsL(l)=o if and only if 1=1. 

The notion of observability for POPTAs is similar to the one for POMDPs, 
but applied to locations. Clocks, on the other hand, are always observable. The 
requirement that the same choices must be available in any observationally- 
equivalent states, implies the same delays must be available in observationally- 
equivalent states, and so unobservable clocks could not feature in invariant or 
enabling conditions. The inclusion of unobservable clocks would therefore neces¬ 
sitate modelling the system as a game with the elapse of time being under the 
control of a second (environment) player. The underlying semantic model would 
then be a partially observable stochastic game (POSG), rather than a POMDP. 
However, unlike POMDPs, limited progress has been made on efficient compu¬ 
tational techniques for this model (belief space based techniques, for example, 
do not apply in general im Even in the simpler case of non-probabilistic timed 
games, allowing unobservable clocks requires algorithmic analysis to restrict the 
class of strategies considered wm- 

Encouragingly, however, we will later show in Sec. [5] that POPTAs with 
observable clocks were always sufficient for our modelling and analysis. 

Restrictions on POPTAs. At this point, we need to highlight a few syntactic 
restrictions on the POPTAs treated in this paper. Firstly, we emphasise that 
clock constraints appearing in a POPTA, i.e., in its invariants and enabling 
conditions, are required to be closed (no strict inequalities) and diagonal-free 
(no comparisons of clocks). This is a standard restriction when using the digital 
clocks discretisation m which we work with in this paper. 

Secondly, a specific (but minor) restriction for POPTAs is that resets can only 
be applied to clocks that are non-zero. The reasoning behind this is outlined later 
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Fig. 1 . Examples of partially observable PTAs (see Examples [l] and [2|. 


in Example [5] Checking this restriction can easily be done when exploring the 
discrete (digital clocks) semantics of the model - see below and Sec. [4] 

Semantics of POPTAs. We now formally define the semantics of a POPTA P, 
which is given in terms of an infinite-state POMDP. This extends the standard 
semantics of a PTA {27] (as an infinite MDP) with the same notion of observ¬ 
ability we gave in Sec. [2 for POMDPs. The semantics, [P]t, is parameterised by 
a time domain T, giving the possible values taken by clocks. For the standard 
(dense-tinre) semantics of a POPTA, we take T = K. Later, when we discretise 
the model, we will re-use this definition, taking T = N. When referring to the 
“standard” semantics of P we will often drop the subscript R and write [P]. 

Definition 7 (POPTA semantics). Let P=(L,l 1 X , A, inv, enab, prob,r,C>L, 
obsL ) be a POPTA. The semantics of P. with respect to the time domain T, is 
the POMDP [P]t=(<S', s, A U T, P, R, OlxT x , 06 s) such that: 

— S = {(l, v) £ ixT* | v \= inv(l)} and s = (1, 0 ); 

— for (l, v) £ S and a £ A UT, we have P((l , v), a) = p if and only if: 

• (time transitions) a £ T, p = Sp tV -|_ a ) and v+t \= inv(l) for all O^t^a; 

• (action transition) a £ A, v \= enab{l, a ) and for ( l',v') £ S: 

p(l , v ) = ^2x^xav'=v[x-—0\ P r °b(l,a)(X,l ) 

— for any (l, v) £ S and a £ A U T, we have R((l , v), a) = 

— for any ( l,v ) £ S, we have obs(l,v) = ( obsL(l),v ). 

Example 1. Consider the POPTA in Fig. [lja) with clocks x, y. Locations are 
grouped according to their observations, and we omit enabling conditions equal 
to true. We aim to maximise the probability of observing 05 . If locations were 
fully observable, we would leave l when x=y= 1 and then, depending on whether 
the random choice resulted in a transition to h or I 2 , wait 0 or 1 time units, re¬ 
spectively, before leaving the location. This would allow us to move immediately 
from I 3 or l± to 1$, meaning observation 05 is seen with probability 1. However, 
in the POPTA, we need to make the same choice in l\ and I 2 since they yield the 
same observation. As a result, at most one of the transitions leaving locations I 3 
and I 4 is enabled, and the probability of observing o 5 is thus at most 0.5. 


f r L (l)-a if a £ T 
\ r a{ 1, a) if a £ A 
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Digital clocks. Since the semantics of a POPTA (like for a PTA) is an infinite- 
state model, for algorithmic analysis, we first need to construct a finite represen¬ 
tation. In this paper, we propose to use the digital clocks approach, generalising 
a technique already used for PTAs [2U|, which in turn adapts one for timed au¬ 
tomata m ■ In short, this approach discretises a POPTA model by transforming 
its real-valued clocks to clocks taking values from a bounded set of integers. 

For clock x £ X, let denote the greatest constant to which x is compared 
in the clock constraints of POPTA P. If the value of x exceeds k x , its exact value 
will not affect the satisfaction of any invariants or enabling conditions, and thus 
not affect the behaviour of P. The digital clocks semantics, written [P]n> can be 
obtained from Defn. [7] taking T to be N instead of R. We also need to redefine 
the operation v+t, which now adds a delay t £ N to a clock valuation v £ N*: 
we say that v+t assigns the value min{?;(ar)+i, k x +l} to each clock x £ X. 

Under the restrictions on POPTAs described above, the digital semantics of 
a POPTA preserves the key properties required in this paper, namely optimal 
probabilities and expected cumulative rewards for reaching a specified observa¬ 
tion. This is captured by the following theorem, which we prove in Appx. [A] 


Theorem 1. If P is a closed, diagonal-free POPTA which resets only non-zero 
clocks, then, for any set of observations O of P and opt £ {min, max}, we have: 

Prf^(FO) = Pr* (FO) and EJ* (FO) = EJ* (FO). 

The proof relies on showing probabilistic and expected reward values agree on the 
belief MDPs underlying the POMDPs representing the dense time and digital 
clocks semantics. This requires introducing the concept of a belief PTA for a 
POPTA (analogous to a belief MDP for a POMDP) and results for PTAs [2D] . 

Example 2. The POPTA P in Fig. [ljb) demonstrates why our digital clocks 
approach (Thm.[l]) is restricted to POPTAs which reset only non-zero clocks. We 
aim to minimise the expected reward accumulated before observing 03 (rewards 
are shown in Fig. [ljb) and are zero if omitted). If locations were fully observable, 
the minimum reward would be 0 , achieved by leaving l immediately and then 
choosing a 1 in l\ and 02 in I 2 . However, if we leave l immediately, 1 1 and I 2 
are indistinguishable (we observe (o 12 , ( 0 )) when arriving in both), so we must 
choose the same action in these locations, and hence the expected reward is 0.5. 

Consider the strategy that waits e £ (0,1) before leaving l, accumulating a 
reward of e. This is possible only in the dense-tinre semantics. We then observe 
either (oi j2 , (e)) in 1 1 , or (oi, 2) (0)) in 1 2 . Thus, we see if x was reset, determine if 
we are in li or / 2 , and take action a\ or a 2 accordingly such that no further reward 
is accumulated before seeing 03, for a total reward of e. Since e can be arbitrarily 
small, the minimum (infimum) expected reward for [P]r is 0. However, for the 
digital clocks semantics, we can only choose a delay of 0 or 1 in l. For the former, 
the expected reward is 0.5, as described above; for the latter, we can again pick 
a± or a 2 based on whether x was reset, for a total expected reward of 1. Hence 
the minimum expected reward for [P]n is 0.5, as opposed to 0 for [P]r. 
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4 Verification and Strategy Synthesis for POPTAs 

We now present our approach for verification and strategy synthesis for POPTAs 
using the digital clock semantics given in the previous section. 

Property specification. First, we define a temporal logic for the formal spec¬ 
ification of quantitative properties of POPTAs. This is based on a subset (we 
omit temporal operator nesting) of the logic presented in J23 for PTAs. 

Definition 8 (Properties). The syntax of our logic is given by the grammar: 

(p::=P t xtp[4>] | ip ::= a | aUct 

a ::= true |o|-’o|£|aAa;|aVa: p ::= I =i | | F a 

where o is an observation, £ is a clock constraint, ixi £ Qn[0,1], 

q £ Q>o and t £ N. 

A property (p is an instance of either the probabilistic operator P or the expected 
reward operator R. As for similar logics, Pjxph/f] means the probability of path 
formula ip being satisfied is Kip, and [p] the expected value of reward operator 
p is Kq. For the probabilistic operator, we allow time-bounded (aU^ f a) and 
unbounded (a U a) until formulas, and adopt the usual equivalences such as 
Fa = trueUa (“eventually a”). For the reward operator, we allow I =t (location 
reward at time instant t ), (reward accumulated until time t ) and Fa (the 

reward accumulated until a becomes true). Our propositional formulas (a) are 
Boolean combinations of observations and clock constraints. 

We omit nesting of P and R operators for two reasons: firstly, the digital clocks 
approach that we used to discretise time is not applicable to nested properties 
(see PD] for details); and secondly, it allows us to use a consistent property 
specification for either verification or strategy synthesis problems (the latter is 
considerably more difficult in the context of nested formulas 0)- 

Definition 9 (Property semantics). Let P be a POPTA with location obser¬ 
vation function obsL and, semantics [PJ. We define satisfaction of a property <p 
from Defn. [§] with respect to a strategy a £ Ajpj as follows: 

[ p ]w|= p Mp[V’] Prjpj ({w £ IPaths j p j \u\=ip})Kp 

[P],cr(=R Mg [p] Ef P] (rew(p)) Kq 

Satisfaction of a path formula %p by path ui, denoted to \= ip and the random 
variable rew(p) for a reward operator p are defined identically as for PTAs. Due 
to lack of space, we omit their formal definition here and refer the reader to w$- 
For a propositional formula a and state s = ( l,v ) of [P], we have s\=o if and 
only if obsi,(l)=o and s\=f if and only if v \= f. Boolean operators are standard. 

Verification and strategy synthesis. Given a POPTA P and property 0, we 
are interested in solving the dual problems of verification and strategy synthesis. 
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Definition 10 (Verification). The verification problem is: given a POPTA P 
and property <p , decide if [P],cr |= <p holds for all strategies creJCjpj. 

Definition 11 (Strategy synthesis). The strategy synthesis problem is: given 
POPTA P and property <p, find, if it exists, a strategy <tGA’[p] such that [P],cr |= <f>. 

The verification and strategy synthesis problems for <p can be solved similarly, by 
computing optimal values for either probability or expected reward objectives: 

Pr fpfW = info-exy, Prf Pj {if) Ej^ n (p) = inf^^, E[ p] (p) 

Pr fpf W = su Pct 6X [p| Prfpj (ip) Efpf (p) = Ef p] (p) 

and, where required, also synthesising an optimal strategy. For example, verifying 
(p=P^ p [ip] requires computation of Pj-jpj 1 ^) since <p is satisfied by all strategies 
if and only if Pr^pj(ip)^p. Dually, consider synthesising a strategy for which 
cp'=P^ p [ip] holds. Such a strategy exists if and only if Pr^pj(ip)^p and, if it does, 
we can use the optimal strategy that achieves the minimum value. A common 
practice in probabilistic verification to simply query the optimal values directly, 
using numerical properties such as P m i n =?['0] and R m ax=?[p]- 

As mentioned earlier, when solving POPTAs (or POMDPs), we may only be 
able to under- and over-approximate optimal values, which requires adapting the 
processes sketched above. For example, if we have determined lower and upper 
bounds p^ < Pr^f(ip) ^ p 11 . We can verify that (p=P^ p [ip] holds if p^ ^ p or 
ascertain that does not hold if p pK But, if < p < p ^, we need to refine our 
approximation to produce tighter bounds. An analogous process can be followed 
for the case of strategy synthesis. The remainder of this section therefore focuses 
on how to (approximately) compute optimal values and strategies for POPTAs. 

Numerical computation algorithms. Approximate numerical computation 
of either optimal probabilities or expected reward values on a POPTA P is per¬ 
formed with the sequence of steps given below, each of which is described in more 
detail subsequently. We compute both an under- and an over-approximation. For 
the former, we also generate a strategy which achieves this value. 

(A) We modify POPTA P, reducing the problem to computing optimal values 
for a probabilistic reachability or expected cumulative reward property E3; 

(B) We apply the digital clocks discretisation of Sec.[3]to reduce the infinite-state 
semantics [P]r of P to a finite-state POMDP [P]n; 

(C) We build and solve a finite abstraction of the (infinite-state) belief MDP 
S([PJ N ) of the POMDP from (B), yielding an over-approximation] 

(D) We synthesise and analyse a strategy for [PJn, giving an under-approximation] 

(E) If required, we refine the abstraction’s precision and repeat (C) and (D). 

(A) Property reduction. As discussed in [27] (for PTAs), checking P or R 
properties of the logic of Defn.[8]can always be reduced to checking either a prob¬ 
abilistic reachability (P^fFa:]) or expected cumulative reward (R^gfa]) prop¬ 
erty on a modified model. For example, time-bounded probabilistic reachability 
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(Pmp[F^* a]) can be transformed into probabilistic reachability (P Mp [F (aA y^t)]) 
where y is a new clock added to the model. We refer to m for full details. 

(B) Digital clocks. We showed in Sec.[3]that, assuming certain simple restric¬ 
tions on the POPTA P, we can construct a finite POMDP [P]n representing P by 
treating clocks as bounded integer variables. The translation itself is relatively 
straightforward, involving a syntactic translation of the PTA (to convert clocks), 
followed by a systematic exploration of its finite state space. At this point, we 
also check satisfaction of the restrictions on POPTAs described in Sec. [3] 

(C) Over-approximation. We now solve the finite POMDP [P]n- For simplic¬ 
ity, here and below, we describe the case of maximum reachability probabilities 
(the other cases are very similar) and thus need to compute Prpjj)(F O). We first 
compute an over-approximation , i.e. an upper bound on the maximum probabil¬ 
ity. This is computed from an approximate solution to the belief MDP P([P]n), 
whose construction we outlined in Sec. [5] This MDP has a continuous state 
space: the set of beliefs Dist(S), where S is the state space of [P]n- 

To approximate its solution, we adopt the approach of [59] which computes 
values for a finite set of representative beliefs G whose convex hull is Dist(S). 
Value iteration is applied to the belief MDP, using the computed values for beliefs 
in G and interpolating to get values for those not in G. The resulting values give 
the required upper bound. We use [55] as it works with unbounded (infinite hori¬ 
zon) and undiscounted properties. There are many other similar approaches [58j . 
but these are formulated for discounted or finite-horizon properties. 

The representative beliefs can be chosen in a variety of ways. We follow [53] . 
where G = {jpv \ v € A v(i)=M}, i.e. a uniform grid with resolution 
M. A benefit is that interpolation is very efficient, using a process called trian¬ 
gulation m- a downside is that the grid size is exponential M. 

(D) Under-approximation. Since it is preferable to have two-sided bounds, 
we also compute an under-approximation: here, a lower bound on Prpj^(FO). 
To do so, we first synthesise a finite-memory strategy er* for [PJ^ (which is often 
a required output anyway). The choices of this strategy are built by stepping 
through the belief MDP and, for the current belief, choosing an action that 
achieves the values returned by value iteration in (C) above - see for example [55]. 
We then compute, by building and solving the finite Markov chain induced by 
[P]n and a*, the value Pr|pj (F O) which is a lower bound for Prpjj^F O). 

(E) Refinement. Finally, although no a priori bound can be given on the error 
between the generated under- and over-approximations (recall that the basic 
problem is undecidable), asymptotic convergence of the grid based approach is 
guaranteed [55]. In practice, if the computed approximations do not suffice to 
verify the required property (or, for strategy synthesis, a* does not satisfy the 
property), then we increase the grid resolution M and repeat steps (C) and (D). 
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5 Implementation and Case Studies 

We have built a prototype tool for verification and strategy synthesis of POPTAs 
and POMDPs as an extension of PRISM m- We extended the existing mod¬ 
elling language for PTAs, to allow model variables to be specified as observable 
or hidden. The tool performs the steps outlined in Sec. [4j computing a pair of 
bounds for a given property and synthesising a corresponding strategy. We fo¬ 
cus on POPTAs, but the tool can also analyse POMDPs directly. The software, 
details of all case studies, parameters and properties are available online at: 

http://www.prismmodelchecker.org/files/formatsl5poptas/ 

We have developed three case studies to evaluate the tool and techniques, dis¬ 
cussed in more detail below. In each case, nondeterminism, probability, real-time 
behaviour and partial observability are all essential aspects required for analysis. 

The NRL pump. The NRL (Naval Research Laboratory) pump |T7 ] is designed 
to provide reliable and secure communication over networks of nodes with ‘high’ 
and ‘low’ security levels. It prevents a covert channel leaking information from 
‘high’ to ‘low’ through the timing of messages and acknowledgements. Com¬ 
munication is buffered and probabilistic delays are added to acknowledgements 
from ‘high’ in such a way that the potential for information leakage is minimised, 
while maintaining network performance. A PTA model is considered in [21j . 

We model the pump as a POPTA using a hidden variable for a secret value 
2 £ {0,1} (initially set uniformly at random) which ‘high’ tries to covertly com¬ 
municate to ‘low’. This communication is attempted by adding a delay of ho or 
hi, depending on the value of z, whenever sending an acknowledgement to ‘low’. 
In the model, ‘low’ sends N messages to ‘high’ and tries to guess z based on the 
time taken for its messages to be acknowledged. We consider the maximum prob¬ 
ability ‘low’ can (either eventually or within some time frame) correctly guess 
z. We also study the expected time to send all messages and acknowledgements. 
These properties measure the security and performance aspects of the pump. 
Results are presented in Fig. [5] varying hi and N (we fix h 0 = 2). They show that 
increasing either the difference between ho and hi (i.e., increasing hi) or the 
number N of messages sent improve the chance of ‘low’ correctly guessing the 
secret z , at the cost of a decrease in network performance. On the other hand, 
when ho=h-[. however many messages are sent, ‘low’, as expected, learns nothing 
of the value being sent and at best can guess correctly with probability 0.5. 

Task-graph scheduler. Secondly, we consider a task-graph scheduling problem 
adapted from [7], where the goal is to minimise the time or energy consumption 
required to evaluate an arithmetic expression on multiple processors with differ¬ 
ent speeds and energy consumption. We extend both the basic model of [7] and 
the extension from m which uses PTAs to model probabilistic task execution 
times. A new ‘low power’ state to one processor, allowing it to save energy when 
not in use, but which incurs a delay when waking up to execute a new task. 
This state is entered with probability sleep after each task is completed. We as¬ 
sume that the scheduler cannot observe whether the processor enters this lower 
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Fig. 2. Analysing security/performance of the NRL pump: (a) Maximum probability 
of covert channel success; (b) Maximum expected transmission time. 


power state, and hence the model is a POPTA. We generate optimal schedulers 
(minimising expected execution time or energy usage) using strategy synthesis. 

Non-repudiation protocol. Our third case study is a non-repudiation proto¬ 
col for information transfer due to Markowitch & Roggeman [25]. It is designed 
to allow an originator O to send information to a recipient R while guaran¬ 
teeing non-repudiation, that is, neither party can deny having participated in 
the information transfer. The initialisation step of the protocol requires O to 
randomly select an integer N in the range 1 ,,K that is never revealed to R 
during execution. In previous analyses 1231271 . modelling this step was not pos¬ 
sible since no notion of (non-)observability was used. We resolve this by building 
a POPTA model of the protocol including this step, thus matching Markowitch 
& Roggeman’s original specification. In particular, we include a hidden variable 
to store the random value N. We build two models: a basic one, where R 's only 
malicious behaviour corresponds to stopping early; and a second, more complex 
model, where R has access to a decoder. We compute the maximum probability 
that R gains an unfair advantage (gains the information from O while being able 
to deny participating). Our results (see Table [TJ show that, for the basic model, 
this probability equals 1 /K and R is more powerful in the complex model. 

Experimental results. Table [l] summarises a representative set of experimen¬ 
tal results from the analysis of our three case studies. All were run on a 2.8 
GHz PC with 8GB RAM. The table shows the parameters used for each model 
(see the web page cited above for details), the property analysed and various 
statistics from the analysis: the size of the POMDP obtained through the digi¬ 
tal clocks semantics; number of observations; number of hidden values (i.e., the 
maximum number of states with the same observation); the grid size (resolution 
M and total number of points); the time taken; and the results obtained. For 
comparison, in the rightmost column, we show what result is obtained if the 
POPTA is treated as a PTA (by making everything observable). 

On the whole, we find that the performance of our prototype is good, es¬ 
pecially considering the complexity of the POMDP solution methods and the 
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Case study 
(parameters) 

Property 

Verification/strategy synthesis of POPTA 

PTA 

result 

States 

(I p ]n) 

Num. 

obs. 

Num. 

hidd. 

Res. 

(M) 

Grid 

points 

Time 

(s) 

Result 

(bounds) 

pump 
('ll N) 

16 2 

16 2 

16 16 

Pmax=? [F guess] 

243 

243 

1,559 

145 

145 

803 

3 

3 

3 

2 

40 

2 

342 

4,845 

2,316 

0.7 

4.0 

16.8 

[0.940,0.992] 
[0.940,0.941] 
[0.999, 0.999] 

1.0 

1.0 

1.0 

pump 
(hi N D) 

8 4 50 

8 4 50 
16 8 50 
16 8 100 

Pmax=? [F^ D guess] 

12,167 

12,167 

26,019 

59,287 

7,079 

7,079 

13,909 

31,743 

3 

3 

3 

3 

2 

12 

2 

2 

17,256 

68,201 

38,130 

86,832 

11.0 

36.2 

52.8 

284.8 

[0.753,0.808] 

[0.763,0.764] 

[0.501,0.501] 

[0.531,0.532] 

1.0 

1.0 

1.0 

1.0 

scheduler 

basic 

(sleep) 

0.25 

0.5 

0.75 

Rmi n= ?[F done] 
(exec, time) 

5,002 

5,002 

5,002 

3,557 

3,557 

3,557 

2 

2 

2 

2 

2 

4 

6,447 

6,447 

9,337 

3.2 

3.1 

3.1 

[14.69, 14.69] 
[17.0, 17.0] 
[19.25, 19.25] 

14.44 

16.5 

18.5 

scheduler 

basic 

(sleep) 

0.25 

0.5 

0.75 

Rmin=?[F done] 
(energy cons.) 

5,002 

5,002 

5,002 

3,557 

3,557 

3,557 

2 

2 

2 

4 

2 

2 

9,337 

6,447 

6,447 

3.1 

3.1 

3.2 

[1.335, 1.335] 
[1.270, 1.270] 
[1.204, 1.204] 

1.237 

1.186 

1.155 

scheduler 

prob 

(sleep) 

0.25 

0.5 

0.75 

Rmin=?[F done] 
(exec, time) 

6,987 

6,987 

6,987 

5,381 

5,381 

5,381 

2 

2 

2 

2 

2 

4 

8,593 

8,593 

11,805 

5.8 

5.8 

5.0 

[15.00, 15.00] 
[17.27, 17.27] 
[19.52, 19.52] 

14.75 

16.77 

18.77 

scheduler 

prob 

(sleep) 

0.25 

0.5 

0.75 

Rmin=?[F done] 
(energy cons.) 

6,987 

6,987 

6,987 

5,381 

5,381 

5,381 

2 

2 

2 

4 

2 

2 

11,805 

8,593 

8,593 

5.3 

5.0 

5.8 

[1.335, 1.335] 
[1.269, 1.269] 
[1.204, 1.204] 

1.3 

1.185 

1.155 

nrp 

basic 

( K) 

4 

4 

8 

8 

Pmax=?[F unfair] 

365 

365 

1,273 

1,273 

194 

194 

398 

398 

5 

5 

9 

9 

8 

24 

4 

8 

5,734 

79,278 

23,435 

318,312 

0.8 

5.9 

4.8 

304.6 

[0.25,0.281] 

[0.25,0.25] 

[0.125,0.375] 

[0.125,0.237] 

1.0 

1.0 

1.0 

1.0 

nrp 

complex 

IK) 

4 

4 

8 

8 

Pmax=?[F unfair] 

1,501 

1,501 

5,113 

5,113 

718 

718 

1,438 

1,438 

5 

5 

9 

9 

4 

12 

2 

4 

7,480 

72,748 

16,117 

103,939 

2.1 

14.8 

6.1 

47.1 

[0.438,0.519] 
[0.438, 0.438] 
[0.344, 0.625] 
[0.344, 0.520] 

1.0 

1.0 

1.0 

1.0 


Table 1. Experimental results from verification/strategy synthesis of POPTAs. 


fact that we use a relatively simple grid mechanism. We are able to analyse 
POPTAs whose integer semantics yields POMDPs of up to 60,000 states, with 
experiments usually taking just a few seconds and, at worst, 5-6 minutes. These 
are, of course, smaller than the standard PTA (or MDP) models that can be 
verified, but we were still able to obtain useful results for several case studies. 

The values in the rightmost column of Table |T] illustrate that the results 
obtained with POPTAs would not have been possible using a PTA model, i.e., 
where all states of the model are observable. For the pump example, the PTA 
gives probability 1 of guessing correctly (‘low’ can simply read the value of the 
secret). For the scheduler example, the PTA model gives a scheduler with better 
time/energy consumption but that cannot be implemented in practice since the 
power state is not visible. For the nrp models, the PTA gives probability 1 of 
unfairness as the recipient can read the random value the originator selects. 

Another positive aspect is that, in many cases, the bounds generated are 
very close (or even equal, in which case the results are exact). For the pump and 
scheduler case studies, we included results for the smallest grid resolution M 
required to ensure the difference between the bounds is at most 0.001. In many 
cases, this is achieved with relatively small values (for the scheduler example, in 
particular, M is at most 4). For nrp models, we were unable to do this when K=8 
and instead include the results for the largest grid resolution for which POMDP 
solution was possible: higher values could not be handled within the memory 
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constraints of our test machine. We anticipate being able to improve this in the 
future by adapting more advanced approximation methods for POMDPs [28] , 

6 Conclusions 

We have proposed novel methods for verification and control of partially ob¬ 
servable probabilistic timed automata, using a temporal logic for probabilistic, 
real-time properties and reward measures. We developed techniques based on a 
digital clocks discretisation and a belief space approximation, then implemented 
them in a tool and demonstrated their effectiveness on several case studies. 

Future directions include more efficient approximation schemes, zone-based 
implementations and development of the theory for unobservable clocks. Allow¬ 
ing unobservable clocks, as mentioned previously, will require moving to partially 
observable stochastic games and restricting the class of strategies. 

Acknowledgments. This work was partly supported by the EPSRC grant “Au¬ 
tomated Game-Theoretic Verification of Security Systems” (EP/K038575/1). 
We also grateful acknowledge support from Google Summer of Code 2014. 
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A Proof of Thm. |T| 


As discussed in Sec. [3], we restrict our attention to POPTAs which reset only non¬ 
zero clocks. Given any clock valuations v, v' and distinct sets of clocks X , Y such 
that u(x)>0 for any i£lU Y we have that u[A':=0] = v' implies v\Y :=0] 7 ^ v'. 
Therefore, under this restriction, for a time domain T and POPTA P, if there 
a exists a transition (l,v) to in [P]t, then there is a unique set of clocks 

which are reset when this transition is taken and we define this set as follows. 


Definition 12. For any clock valuations v,v' £ T x we define the set of clocks 
X\v 1 ^ v'] as follows: = {x £ X \ v(x)>0 A u'(a;)=0}. 


Using Defn. 12 the transition function of [P]t is such that, for any (l, v) £ S and 
a £ A, we have P((l, v), a)=p if and only if v |= enab(l,a ) and for £ S: 


/,/ m _ f prob(l,a){X[ vl ^ v ,],l') if v[X[ Vh + v ,y.=0] = v' 

’ v ' | 0 otherwise. 

Before we present the proof of Thm. [TJ we require the concept of a belief PTA. 

Definition 13. Given a POPTA P=(L,l,X, A,inv, enab,prob,r,OL, obsi,), the 
belief PTA is given by B( P) = (Dist obSL , Sj, X, A, inv B , enab B , prob B ,r B ) where 

— Dist obSL C Dist(L) where A £ Dist obSL if and only if for any 1,1' £ L such 
that A(7)>0 and A(7')>0 we have obsL(l) = obsL(l'); 

— the invariant inv B : Dist obSL -^CC{X ) and enabling conditions enab B : 
Dist obSL xA —7 CC(X ) are such that for A £ Dist obSL and a £ A we have 
inv B (X)=inv(l) and enab B (X, a)=enab(l, a) where l £ L is such that A(7)>0; 

— the probabilistic transition function prob B : Dist obSL xA —► Dist(2 x xDist obSL ) 
is such that for any A, X £ Dist obSL , a £ A and X C X we have: 


prob B (\,a)(X,X) = Jf, HI) ■ \ £ prob{l,a){V,X) 

l&L \oGOAX“'“' x =A' l'eLAobs L (l')=o 

where for any l' £ L we have 

( _ J2i € LP rob (^ a )( 1 ' ’X)-x(i) _ 

A_ ) J2i£ L Hl)-(T,i> e LAob SL <.i')=oP rob ( l ’a)(i l , x 

[ 0 otherwise; 

— the reward structure r 8 =(r£,r®) is such that 

r lW = T,ieL X ( l )- r L(l) and r£(A, o) = J^ leL X(l) • r{l, a). 

For the above to be well defined we require the conditions of the invariant and 
observation function given in Defn. [bj For any A £ Dist obSL we let o\ be the 
unique observation such that o6sl(0 = °x and A (l) > 0 for some l £ L. 

Next we introduce the semantics of a PTA which, similarly to Defn. [7J is 
parameterised by a time domain T. 


>) 


if obs L {V)=o 
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Definition 14 (PTA semantics). Let P = (L,l, X , A,inv, enab, prob,r) be a 
probabilistic timed automaton. The semantics of P with respect to the time do¬ 
main T is the MDP [P]t = (S , s, A U T, P, R ) such that: 

— S = {(7, v) £ L x T x | v |= inv(l)} and s = (l, 0); 

— for any (l, v) £ S and a £ A U T, we have P((l, v), a) = p if and only if one 
of the following conditions hold: 

• (time transitions) a £ T, fi = S^ v+a ) and v+a \= inv(l) for all O^it'^a; 

• (action transition) a £ A, v \= enab(l,a ) and for (l',v') £ S: 

h(l > v ) = S XCXAv'=v[X:=0] P r °b{l, a)(X, l ) 

— for any (l, v) £ S and a £ A U T: R({1, v ), a) = 

We now show that for a POPTA P, the semantics of its belief PTA is equivalent 
to the belief MDP of the semantics of P. 

Proposition 1. For any POPTA P which resets only non-zero clocks, time do¬ 
main T we have that the MDPs [£>(P)]t and 0([P]t) are equivalent. 

Proof. Consider any POPTA P = (L,l, X , A, inv, enab, prob,r,(D obsif) which 
resets only non-zero clocks and time domain T. To show the MDPs [£?(P)]t and 
0([P] T ) are equivalent we will first give a direct correspondence between their 
state spaces and then show that under this correspondence both the transition 
and reward functions are equivalent. 

Considering the belief MDP P([P] T ) and using the fact that obs(l,v) = 
( obsL(l),v ), for any belief states b,b' and action a we have P B (b,a)(b') equals 


| r L (l)-a if a£ T 
\ r A (l, a) if a£ A. 


b ^ v ) 

(o,t> 0 )eOxT* (■ i,v)es 

b a.,(,o,v 0 -)_ b > 


E P((l,v),a)(l', 


1 1'(fzL/\obs 


where for any belief b , action a, observation (o,v 0 ) and state (l', v r ), we have 
b a ’(°' v °\l',v') equals 


E 




s P{{l,v),a){l',v' ) -b(l,v) 


Hhv)-(j2l"<=LA„b, L {l")=o P((hv)’ a )(l"y)) 

o 


E 


C,»)es 


if obsz / (l')=o and v'=v a 
otherwise 


( 1 ) 


and R B (b,a) = Xqz v )eS -R((^> v )> a ) ' b(l,v). Furthermore, by Defn. [ 7 ] and since 
in P only non-zero clocks are reset, if a € A: 


P{(l,v),a)(l',v') 


prob{l,a)(X[ v ^ v >],l') if w[X [t ,^. v /]:=0] = v' 
0 otherwise 


( 2 ) 


while if a £ T: 


P{(l,v),a)(l',v') 


1 if l'=l and v' = v+a 
0 otherwise. 


(3) 
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We see that b a ’ < ' 0,Vo ' > (V, v') is zero if v' ^ v 0 , and therefore we can write the belief 
as (X,v 0 ) where A G Dist(L) and X(l) = b a, (°’ Vo \l,v 0 ) for all l G L. In addition, 
for any l' G L , if X(l')>0, then o 6 si(/')=o. Since the initial belief b can be written 
as (<%, 0) and we assume o&sl(I) 7 ^ o6sl(0 for any l ^ l G L, it follows that we 
can write each belief b of £>([P]t) as a tuple (A,u) G Dist(L) x T x such that 
for any 1,1' G L, if X(l)>0 and A(/')>0, then o&sl(7)=o&sl(Z'). Hence, we have 
shown that the states of S([P]t) are equivalent to the states of [H(P)]t- 

We now use this representation for the states of 6([P]t) to rewrite the above 
equations for the transition and reward function of the belief MDP £>([P]t)- We 
have the following two cases to consider. 

— For any belief states (X,v) and ( X',v ') and action a G A: 


P B ((X,v),a)(X',v') 


( \ 

E E A (0- E 

oGOl l&L I’eL 

A »,(o,»') =v \0(/')=o / 


E E a <')- 

oGO jj ZG .£/ 

A «,(o,„') =V 


( \ 

E prob(l,a)(X[ vh + v ,],l') 


i'eL 

\0(Z')=o 


/ 


= E A (o- 


/ \ 

E E prob(l,a)(X[ Vh ^. v q,l') 

o&O l I'eL 

\ A a,(o,«')=VO(Z')=0 / 


by (( 2 J) 


rearranging 


where for any f G f: 

A a ’(° y) (r) = 




E !6 L- p ((bv),a)(7',i/)-A(Z) 


if obsL,{V) = O 


A (0-(El"eLAo6s L ([")=o P((b v )i a )(L" 

0 otherwise 

- 7 - 0^(0 - if obsL{l')=o 

A (0-(Ei"eLAoi,» i (i")=oP™ h (ba)bY[«'^v'].Z')) by ([U 

0 otherwise 

—by Defn. [13) 


Using this result together with Defn. [13] and Defn. [14] it follows that the 
transition functions of £>([P]t) and [£?(P)]t are equivalent in the case. Con¬ 
sidering the reward functions, we have R B ((X,v),a) = XEl r A{l, a)-X(l) 
which, again from Defn. [13] and Defn. [l4j shows that the reward functions 
of B([P]t) and |S(P)]t are equivalent. 

— For any belief states (X,v) and (A',i/) and time duration t G T: 

pB (( a t)(x > v > )= !Zi ( :LX{l)-P{{l,v),a)(l,v')i£ \«°>S)=X' 

\ 0 otherwise 
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where for any V £ L: 




Mi') 

HiieL Ml) 


if v' = v+t 


0 otherwise 

A (l 1 ) if v' = v+t 
0 otherwise 


since A is a distribution. 


Substituting this expression for X t ^° > " v ) into that of P B ((X,v),t) we have: 
P B ((A,t,),t)(A>') 

_ f EielMO ■ (E veL p (( 1 ’ a)(C v ')) if A =A' and v' = v+t 


0 

_ J E;gl Ml) if an d v ' = v+t 

0 otherwise 


otherwise 


by ^ 


f 1 if A=A' and v' = v+t 
[ 0 otherwise 


since A is a distribution 


which from Defn. 13 and Defn. 14 shows the transition functions of jB([P]t) 
and [B(P)]t are equivalent. For the reward function, we have R B (( A, v), t ) = 
E i£L( r L(l)'t)-X(l) and from Defn. 13 and Defn. 14 this implies that the 


reward functions of S([P]t) and [S(P)Jt again are equivalent. 


Since these are the only cases to consider, both the transition and reward func¬ 
tions of S([P]t) and [0(P)]t are equivalent completing the proof. □ 


We are now in a position to present the proof of Thm. [l] 

Proof (of Thm. [7p. Consider any closed diagonal-free POPTA P which resets 
only non-zero clocks and set of observations O of P. Since the PTA B(P) is 
closed and diagonal-free, using results presented in [2D], we have that: 


P 4W FT °> = P 1W FT °> “ d E Jw)i.< FT °> = 1 w)i»< FT °> < 4 > 


where opt £ {min, max} and T a = {(Z,u) | obs(l) £ Oj. Note that, although [2D] 
considers only PTAs with finite locations, the corresponding proofs do not re¬ 
quire this fact, and hence the results carry over to £>(P) which has an uncountable 
number of locations. 

Due to the equivalence between a POMDP and its belief MDP we have: 


Prp h PO) = Pr° B * Ph) (FT 0 ) and E$ t (FO) = E^ [p]t) (FT 0 ). (5) 

Using Proposition [I] and since P resets only non-zero clocks we have [B(P)]t = 
S([PJt)- Combining this with Q and ([5]) the theorem follows. □ 


B Construction of the Belief MDP 

For the convenience of the reviewers, below we include, for a given POMDP M, 
the construction of the corresponding belief MDP S(M). 
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First, let us suppose we are in a belief state b, perform action a and observe 
o. Based on this new information we move to a new belief state b a, ° using the 
observation function of M. Let us now construct this new belief state. First, by 
construction, we have for any s' £ S: 


b a ’°{s') 


Pr[s' | o,a,b] 
Pr[s',o, a, b] 
Pr[o, a, b] 

( Prfs'.a,?)] 
< Pr[o,a,6] 
0 


if obs(s') = o 
otherwise 


by definition of conditional probabilities 


by definition of obs. 


Now considering the numerator in the first case, since the value of s' is dependent 
on the other values: 


Pr[V,a, 6] = Pr[s' | a, b] ■ Pr[a, b] 

= Pr[s' | a, 6] • 1 
= E sS s Pr [ s ' |a,s] • b(s) 
= E se s p (s,«)(s') • b(s) 


since b and a are fixed 
definition of b 
definition of P. 


For the denominator since o is dependent on b and a we have: 


Pr[o,a,6] = Pr[o | a, b] • Pr[a, b] 

since b and a are fixed 
by definition of b 
by definition of P 
by definition of obs. 


^ 7 yy if obs(s')=o 
otherwise. 

Now using this we can define the probabilistic transition function of the belief 
MDP S(M). Suppose we are in a belief state b and we perform action a. Now 
the probability we move to belief b' is given by: 

P$(b,a)(b') = Pr[6' | a,b] = E 0 eO Pr [ fo ' I o, a, b] • Pr[o | a, b]. 

The first term in the summation (Pr[6 / 1 o, a, 6]) is the probability of being in 
belief b' after being in belief 6, performing action a and observing o. Therefore 
by definition of b a ’°, this probability will equal 1 if b' equals b a, ° and 0 otherwise. 


= Pr[o | a, b] -1 

= E s 6 s Pr [°l a > s ] • b ( s ) 

= E seS (E s ' 6S Pr [o|s',a] •P(s,a)(s / )) • b(s) 

= Eses(j2s'eSAobs{ s ')=o P ( s ’ a )( s ')) ■ b ( s ) 
Combining these results (and rearranging) we have: 

( _ E„gs P{s,a)(s') b(s) 

b a, °(s') = \ UlsgS ^( S )'(5Ds"eSAol>s(s")=o -P( s > a )( 

0 


For the second term, as in the derivation of the denominator above, we have: 
Pr[ o\a,b] = E ses b ( 3 ) ' (E s ' S sa 0 & s ( s ')= 0 P ( s > a )( s ')) 
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This completes the construction of the transition function of the belief MDP. 

It remains to construct the reward function R B of the belief MDP. In this case, 
we just have to take the expectation of the reward with respect to the current 
belief, i.e. for any action a and belief state b: 

R B (b,a) = ^2 seS R(s,a)-b(s). 

The optimal values for the belief MDP equal those for the POMDP, e.g. we have: 

Pr^(FO) = Pr%$ } (FT 0 ) and EjS“(F O) = E^(FT 0 ) 
where Tq = {b G Dist(S) \ Vs € S. ( b(s)>0 —► obs(s) S O)}. 



